Bioinformatics A Practical Guide to Next Generation Sequencing Data Analysis (Hamid D. Ismail)

xiv ◾ Preface

databases. Most programs used in this book are open-source Unix/Linux-based programs.

Others can be used in Anaconda environments.

Chapter 1 discusses sequencing data acquisition from NGS technologies and databases,

FASTQ file format, and Phred base call quality. The chapter covers the quality assessment

of the FASTQ and read quality metrics in some detail so that the readers can diagnose

potential problems in raw data and learn how to fix any possible quality problem before

analysis.

Chapter 2 discusses read alignment/mapping to reference genomes. The strategies of

both reference genome indexing algorithms and read mapping algorithms are discussed in

detail with illustrations so that the readers can understand how mapping process works,

the different indexing and alignment algorithms currently used, and which aligners are

good for RNA sequencing applications. The chapter discusses indexing and searching

algorithms like suffix tree, suffix arrays, Burrow-Wheeler Transform (BWT), FM-index,

and hashing, which are the algorithms used by aligners. The chapter then discusses the

mapping process and aligners like BWA, Bowtie, STAR, etc. The SAM/BAM file format

is discussed in detail so that the reader can understand how alignment information are

stored in fields in the SAM/BAM file. Finally, the chapter discusses the manipulation of

alignments in SAM/BAM files using Samtools programs for different purposes, including

SAM to BAM conversion, alignment sorting, indexing BAM files, extracting alignments of

a chromosome or a specific region, filtering and counting alignment, removing duplicate

reads, and generating descriptive statistics.

Chapter 3 discusses de novo genome assembly and de novo assembly algorithms includ-

ing greedy algorithm, overlap-consensus graphs, and de Bruijn graphs. The quality assess-

ment of the assembled genome is discussed through two approaches: statistical approach

and evolutionary approach.

Chapter 4 covers variant calling (SNPs and InDels) in detail. The introduction of this

chapter discusses variants, variant file format (VCF), and the general workflow of the vari-

ant calling. The chapter then discusses both consensus-based variant calling and hap-

lotype-based variant calling and example callers from each group including BCFTools,

FreeBayes, and GATK best practice variant calling pipelines. Finally, the chapter discusses

variant annotation and prioritization and annotation programs including SIFT, SnpEff,

and ANNOVAR.

Chapter 5 discusses RNA-Seq data analysis. The introduction includes RNA-Seq basics

and applications. The chapter then discusses the steps of RNA-Seq analysis workflow,

including data acquisition, read alignment, alignment quality control, quantification,

RNA-Seq data normalization, statistical modeling and differential expression analysis,

using R packages for differential analysis, and visualization of RNA-Seq data.

Chapter 6 covers ChIP-Seq data analysis. It discusses in detail the workflow of ChIP-Seq

data analysis including data acquisition, quality control, read mapping, peak calling, visu-

alizing peak enrichment and peak distribution, peak annotation, peak functional analysis,

and motif discovery.

Chapter 7 discusses targeted gene metagenomic data analysis (amplicon-based microbial

analysis) for environmental and clinical samples. The chapter covers raw data preprocessing